VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

نویسندگان

چکیده

Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in coherent storytelling. Following the human perception process, where scene is effectively understood by decomposing it into visual (e.g. human, animal) and non-visual components action, relations) under mutual influence vision language, we first propose visual-linguistic (VL) feature. In proposed VL feature, modeled three modalities including (i) global environment; (ii) local main agents; (iii) linguistic elements. We then introduce autoregressive Transformer-in-Transformer (TinT) simultaneously capture semantic coherence intra- inter-event contents within video. Finally, present new contrastive loss function guarantee learnt embedding features are consistent captions semantics. Comprehensive experiments extensive ablation studies on ActivityNet Captions YouCookII datasets show that Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms previous state-of-the-art methods terms accuracy diversity. The source code made publicly available at: https://github.com/UARK-AICV/VLTinT.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

End-to-End Dense Video Captioning with Masked Transformer

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevent...

متن کامل

VXT: Visual XML Transformer

The ever growing amount of heterogeneous data exchanged through the Internet, combined with the popularity of XML, make structured document transformations an increasingly important application domain. Most of the existing solutions for expressing XML transformations are textual languages, such as XSLT or DOM combined with a general-purpose programming language. Several tools build on top of th...

متن کامل

Transformer proteins

Proteins are generally believed to adopt a unique fold, defined by their amino acid sequence, under specific environmental conditions. These unique structures, in turn, endow proteins with one specific function. However, not all proteins obey the “1 amino acid sequence → 1 fold → 1 function” scheme. Moonlighting proteins that adopt one distinct threedimensional structure but can accomplish two ...

متن کامل

Transformer in Grid

The aim of this research article is to determine the way to install surge arresters close to a power transformer to provide protection against lightning overvoltage. Depending on the length of the cables used in the installation, the insulation levels in base insulators of surge arresters and bushings of transformers change according to the voltage they support. For validation purposes, the vol...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i3.25412